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Doubts  about  the  generality  of  results  produced  by  psychological 
research  have  been  expressed  with  Increasing  frequency  since  Koch  (1959) 
observed,  after  a  monumental  review  of  scientific  psychology  In  1959, 
that  there  Is  "a  stubborn  refusal  of  psychological  findings  to  yield  to 
empirical  generalization"  {pp.  729-788).  Brunswik  (1952,  1956), 

Campbell  and  Stanley  (1966),  Cronbach  (1975),  Epstein  (1979,  1980), 
Einhorn  and  Hogarth  (1981),  Greenwald  (1975,  1976),  Hammond  (1966), 

Meehl  (1978)  and  Simon  (1979)  among  others,  have  also  called  attention 
to  this  situation.  Jenkins  (1974),  warned  that  "a  whole  theory  of  an 
experiment  can  be  elaborated  without  contributing  In  an  Important  way  to 
the  science  because  the  situation  Is  artificial  and  nonrepresentative*' 
[Italics  added]  (p.  794).  Tulving  (1979)  makes  the  startling 

observation  that  "after  one  hundred  years  of  laboratory-based  study  of 
memory,  we  still  do  not  seem  to  possess  any  concepts  that  the  majority 
of  workers  would  consider  necessary  or  Important"  (p.  27).  Nor  Is  It 
unusual  for  reviewers  of  a  body  of  literature  to  find,  as  Hastle  and 
Park  (In  press)  do,  that  over  50  studies  have  been  carried  out  on  a 
given  topic  without  yielding  a  definite  conclusion.  Meehl  (1978), 
summarized  the  consequences  of  the  failure  to  develop  generallzable 
findings  by  saying: 

There  Is  a  period  of  enthusiasm  about  a  new  theory,  a  period 
of  attempted  application  to  several  fact  domains,  a  period  of 
disillusionment  as  the  negative  data  come  In,  a  growing 
bafflement  about  Inconsistent  and  unrepllcable  empirical 
results,  multiple  resort  to  ad  hoc  excuses,  and  then  finally 
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people  just  lose  Interest  In  the  thing  and  pursue  other 
endeavors  (p.  807). 

It  Is  our  view  that  this  situation  is  caused  by  the  lack  of  an 
analytical  means  for  generalizing  and  thus  aggregating  results. 
Consequently,  aggregation  rests  largely  on  researchers'  Intuitive 
judgments  of  what  constitutes  generality  of  results  over  conditions. 


In  an  effort  to  develop  an  analytical  methodology,  and  thus 
contribute  to  the  development  of  a  cumulative  science,  we  build  upon  two 
previous  methodological  suggestions,  one  from  the  field  of  Individual 
differences  (the  multitralt-multimethod  (MTMM)  matrix  introduced  by 
Campbell  and  Fiske,  1959),  and  one  from  the  field  of  experimental 
psychology  (the  representative  design  of  experiments  Introduced  by 
Brunswik,  1956).  Data  from  a  study  of  experts  who  made  judgments  of 
three  concepts  by  three  different  methods  provided  a  unique  opportunity 
not  only  to  make  use  of  each  of  these  suggestions  but  to  combine  and 


extend  them. 


Research  Context 


The  purpose  of  this  study  was  to  examine  the  relative  efficacy  of 
intuitive,  quasi-rational  and  analytical  cognition.  Twenty  engineers 
judged  the  aesthetic  value,  safety,  and  capacity  of  40  highways  under 
three  cognitive  conditions.  Each  engineer's  judgments  were  studied  in 
each  cell  of  the  diagram  presented  in  Figure  1.  Intuition  was  Induced 
by  requiring  each  expert  to  judge  each  concept  (aesthetics,  safety, 
capacity)  from  film  strips  of  one-  to  three-mile  segments  of  each  of  the 
40  highways.  Quasi  rationality  was  Induced  by  requiring  each  expert  to 
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judge  each  concept  from  bar  graphs  that  presented  the  values  of  nine 
attributes  for  each  highway.  Analytical  cognition  was  Induced  by 
requiring  each  engineer  to  construct  a  mathematical  formula  for  each 
concept.  An  empirical  criterion  was  available  for  each  concept,  and 
thus  It  was  possible  to  evaluate  the  accuracy  of  each  expert's  judgment 
In  relation  to  each  concept.  The  criterion  for  the  aesthetic  value  of 
each  highway  was  the  mean  judgment  of  91  citizens  who  judged  the  highway 
segments  by  rating  the  film  strips,  or  by  rating  or  ranking  single 
frames  from  the  film  strips.  The  criterion  for  safety  was  the  accident 
rate  for  each  highway  segment  averaged  over  7  years.  The  criterion  for 
capacity  was  the  figure  calculated  by  using  the  procedure  from  the 
m ghway  Capacity  Manual  1965  (Highway  Research  Board,  1965).  Each 
expert  devoted  roughly  20  hours  to  the  nine  sessions,  each  of  which  was 
separated  by  two-week  Intervals.  (See  Hammond,  Hamm,  Grassla,  i 
Pearson,  1984,  for  details.) 


Insert  Figure  1  about  here 


Plan  of  Wie  Article 

In  what  follows  we  first  present  a  description  of  the 
Campbell /FI ske  MTMM  matrix.  Second,  we  extend  the  MTMM  matrix  to 
include  a  "coherence  validity"  matrix.  Third,  we  Indicate  how  a 
"performance  validity"  matrix  Incorporates  the  essential  element  of 
Brunswik's  representative  design  of  experiments.  Fourth,  we  show  how 
the  complementarity  of  the  two  matrices  leads  to  the  development  of  a 
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measure  of  competence.  Fifth,  we  Illustrate  how  the  use  of  this 
methodology  provides  an  analytical  means  for  evaluating  the  results  of 
experiments.  Sixth,  we  use  these  matrices  to  Illustrate  the  analytical 
Incompleteness  and  overgeneral Izatlon  Inherent  In  conventional  methods 
of  accumulating  results. 

Campbell  and  Fiske's  Multitralt-Multimethod  (MTMM)  Matrix 

In  their  1959  article,  Campbell  and  Fiske  convincingly  demonstrated 
the  faults  of  the  conventional  single-concept  single-operation 
methodology  In  the  field  of  Individual  differences.  They  showed  that 
results  of  studies  In  this  field  were  more  likely  to  be  determined  by 
the  methods  employed  than  by  the  traits  hypothesized  to  account  for  the 
results.  Although  they  also  showed  that  the  failure  to  separate  the 
effects  of  method  from  the  effects  of  concept  can  be  avoided  by  use  of 
the  MTMM  matrix,  there  has  been  little  change  In  conventional  research 
methodology;  current  research  In  this  area  still  falls  to 
systematically  separate  concept  and  method  (see,  for  example,  Pervin, 
1985). 

The  problem  Is  not  that  Campbell  and  Fiske's  work  has  gone 
unrecognized.  It  has  become  a  milestone  in  the  methodological 
literature  of  psychology,  and  by  1983  had  been  cited  over  1000  times. 
Yet  In  spite  of  the  potential  of  the  MTMM  matrix  for  breaking  the  grip 
of  a  simpleminded  opera tionism  on  psychological  research,  the  method  is 
for  the  most  part  simply  not  used.  Presumably  researchers  have  avoided 
it  for  tactical  reasons,  since  It  Introduces  conceptual  complexity 
(which  concepts  and  which  methods  should  be  compared?)  and  requires 
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considerable  additional  labor  and  apparatus  within  a  single  study  (at 
least  two  methods  and  two  concepts  must  be  tested).  Or  perhaps 
researchers  are  generally  unaware  of  the  ephemeral  character  of  results 
produced  by  single-concept  single-method  operatlonism.  Whatever  the 
reason,  among  tens  of  thousands  of  studies  of  Individual  differences, 
Turner  (cited  In  Flske,  1982)  found  only  70  published  matrices  between 
1967  and  1980  (see  Flske,  1982,  for  a  general  review).  So  far  as  we  can 
ascertain,  the  MTMM  approach  Is  never  used  In  experimental  psychology. 
(Campbell  and  Flske  cite  one  exception,  an  experimental  study  that 
employed  the  MTMM  matrix  to  examine  Individual  differences.  In  that 
study  also,  "the  highest  correlations  are  found  among  different 
constructs  from  the  same  method,  showing  the  dominance  of  apparatus  or 
method  factors  so  typical  of  the  whole  field  of  Individual  differences" 
(1959,  p.  86).) 

The  MTMM  matrix,  presented  In  Table  1,  Is  developed  from  a  set  of 
test  scores  taken  from  a  group  of  subjects  (Campbell  4  Flske,  1959). 
The  scores  for  each  subject  are  correlated  over  several  traits  and 
methods.  This  Illustration  Is  based  on  data  from  "three  different 
traits,  each  measured  by  three  methods,  generating  nine  separate 
variables"  (1959,  p.  82).  The  reliabilities  are  Indicated  In 
parentheses  In  the  main  diagonal.  The  convergent  validity  coefficients 
(monotrait-heteromethod)  that  are  derived  from  measuring  the  same  trait 
by  different  methods  are  shown  in  the  lower  diagonals  (e.g.,  .57,  .57, 

and  .46  et  seq.).  A  heterotral t-monomethod  triangle  lies  below  the  main 
reliability  diagonal  and  a  heterotrait-heteromethod  triangle  lies  to 
either  side  of  each  validity  diagonal. 


Generalizing  over  Conditions 


Page  8 
2  January  1986 


Insert  Table  1  about  here 


Campbell  and  Fiske  (1959)  note  that  "a  validity  value  for  a 
variable  should  be  higher  than  the  correlations  obtained  between  that 
variable  and  any  other  variables  having  neither  trait  nor  method  In 
common.  This  requirement  may  seem  so  minimal  and  so  obvious  as  to  not 
need  stating,  yet  an  Inspection  of  the  literature  shows  that  It  Is 
frequently  not  met,  and  may  not  be  met  even  when  the  validity 
coefficients  are  of  substantial  size"  (pp.  82-83).  Thus,  Campbell  and 
Fiske  (1959)  introduce  not  only  the  concept  of  convergent  validity  but 
discriminant  validity.  A  trait  should  not  only  be  measured  by  results 
from  different  methods  which  converge  upon  It,  but  also  discriminated 
from  its  rivals. 

The  value  of  this  methodology  has  been  widely  recognized  (see, 
e.g..  Brewer  &  Collins,  1981;  Fiske,  1982),  and  its  application  will 
yield  definite  and  useful  conclusions  regarding  the  validity  of 
psychological  traits.  (See  Wi daman,  1985,  who  reviews  criticisms  of  the 
MTMM  methodology  as  originally  proposed  and  offers  a  rigorous  procedure 
for  evaluating  MTMM  matrices;  see  Schmitt,  Coyle,  S  Saarl,  1977;  Kenny, 
1979;  Marsh  &  Hocevar,  1983,  for  the  use  of  confirmatory  factor  analytic 
methods  to  analyze  the  MTMM  matrix.  See  Schmitt,  1978,  for  the  use  of 
path  analysis  for  evaluating  a  MTMM  matrix;  see  Farh,  Hoffman,  i 
Hegarty,  1984,  for  a  recent  application.)  The  results  from  such  a 
matrix  will  have  populational  and  task  generality  insofar  as  the  trait 
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domain,  the  apparatus/method  domain  and  the  subject  domain  have  been 
adequately  sampled.  The  results  therefore  establish  the  construct 
validity  of  the  traits  Investigated,  separate  from  the  methods  used, 
within  the  restraints  chosen  by  the  investigator. 

Extension  of  the  Campbell /FI ske  Approach 

In  this  research  context  we  extend  Campbell  and  Fiske's  MTMM  matrix 
from  the  evaluation  of  the  construct  validity  of  certain  (a)  tral ts 
within  the  study  of  (b)  Individual  differences  based  on  (c)  group  data 
to  (a)  the  construct  validity  of  judgments  of  concepts  In  a  coherence 
validity  matrix  (b)  a  performance  validity  matrix  that  incorporates 
criterion  measures  (and  their  Intra-ecological  correlations/  for  each 
concept,  and  (c)  the  behavior  of  the  Individual  rather  than  of  the 
group,  although  group  data  can  be  analyzed  as  well.  (See  Hammond, 
McClelland,  i  Mumpower,  1980,  pp.  115-127  on  the  advantages  of 
single-subject  analysis;  also  Meehl ,  1978,  and  Serlln  4  Lapsley,  I'iSS, 
on  the  problematic  nature  of  conventional  between-group  comparisons.) 

The  coherence  validity  matrix.  By  extending  the 
multi  trait-multimethod  procedure  to  evaluate  the  construct  validity  of 
an  individual's  judgments  of  concepts  (as  in  the  present  study  of 
highway  engineers)  we  can  determine  whether  the  individual  has  indeed 
mastered  each  concept  and  is  able  (a)  to  use  it  across  different  methods 
(convergent  validity)  and  (b)  to  differentiate  it  from  other  concepts 
(discriminant  validity).  The  analysis  of  the  construct  validity  of  an 
individual's  judgments  by  means  of  such  a  multiconcept-multimethod 
matrix  is  analogous  to  the  analysis  of  the  traditional  Campbell/Fiske 
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MTMM  matrix,  with  "concepts"  substituted  for  "traits"  in  Table  1. 
Because  the  coherence  of  the  individual's  judgments  of  concepts  can  be 
ascertained  from  a  multiconcept-multimethod  matrix,  it  is  called  a 
"coherence  validity  matrix."  Of  course,  the  coherence  matrix  can  be 
applied  to  individual  behavior  other  than  judgments.  Problem-solving 
behavior,  memory,  and  similar  cognitive  functions  as  well  as  psychomotor 
functions  can  also  be  evaluated  by  means  of  this  matrix. 

The  use  of  the  coherence  validity  matrix.  Data  for  a  coherence 
validity  matrix  based  on  our  study  of  highway  engineers  are  presented  in 
Table  2.  The  data  for  the  matrix  were  generated  from  the  means  of  the 
20  engineers'  judgments  for  each  of  40  highways  presented  to  them  for 
each  concept-method  pair.  (Note:  Fisher's  z-transformation  was  used 
throughout  this  study  in  the  calculation  of  mean  values.)  Thus,  the 
matrix  illustrates  the  particulars  of  the  behavior  of  an  artificial 
engineer  constructed  from  the  mean  judgments  of  this  group.  Data  from 
the  artificial  engineer  are  presented  mainly  to  illustrate  the  use  of 
the  method;  no  inferences  can  be  drawn  from  the  matrix  in  Table  2  to  a 
matrix  generated  for  any  one  engineer.  Illustrations  of  individual 
matrices  are  provided  in  Hammond,  Hamm,  and  Grassia  (1984). 


Insert  Table  2  about  here 


Each  of  the  descriptions  of  the  matrix  presented  by  Campbell  and 
Fiske  (1959)  apply  to  the  matrix  in  Table  2.  The  three  validity 
diagonals  contain  values  that  provide  evidence  regarding  convergent 


Generalizing  over  Conditions 


Page  11 
2  January  1986 


validity,  and  the  heteroconcept  triangles  adjacent  to  them  provide 
evidence  regarding  discriminant  validity. 


Brunswik's  Representative  Design 


Brunswik's  (1943,  1952,  1956)  argument  that  generalization  over 
conditions  requires  the  representation  of  ecological  conditions  in  the 
design  of  experiments  must  also  be  considered  a  milestone  in  the 
methodological  literature  of  psychology  (see  Hammond  &  Wascoe,  1980,  for 
some  examples  of  the  use  of  representative  designs).  Brunswik's  work 
has  also  been  cited  over  1000  times,  yet  representative  designs  are 
seldom  employed.  Presumably,  the  same  reasons  that  lead  students  of 
individual  differences  to  forgo  the  use  of  the  MTMM  matrix  also  lead 
experimental  psychologists  to  forgo  the  use  of  representative  design: 
both  are  more  difficult  and  time-consuming  to  execute  than  standard 
experiments  in  which  the  central  effort--clear  separation  of  one 
condition  from  another--is  carried  out  without  regard  to  the  arrangement 
of  conditions  in  the  organism's  natural  habitat.  As  Brunswik  (1956) 
observed,  however:  "generalization  of  results  concerning  the  relative 
weights  of  the  variables  involved  must  remain  limited  unless  at  least 
the  range,  but  better  also  the  distribution  of  the  'levels  of  strength' 
employed  for  each  variables,  has  been  made  representative  of  a  carefully 
defined  universe  of  conditions  (p.  55).  "Carefully  defining  a  universe 
of  conditions"  and  representing  them  is  far  more  difficult  than  merely 
separating  them  according  to  design  requirements,  yet  that  is  what 
generalization  requires. 
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The  Performance  Val idity  Matrix 

TdOle  3  presents  a  further  extension  of  the  Campbell/Fiske  MTMM 
matrix.  By  linking  the  nine  method-concept  conditions  with  criterion 
measures  for  the  three  concepts,  it  becomes  possible  to  evaluate  the 
performance  of  each  engineer  across  different  concepts  and  different 
methods.  The  three  validity  diagonals  contain  monoconcept  correlations 
between  each  set  of  judgments  (one  for  each  method)  and  the  criterion 
for  the  same  concept.  For  example,  in  the  upper  left-hand  corner  of 
Table  3,  .855  is  the  correlation  between  the  artificial  engineer's 
judgments,  under  the  film-strip  method  of  the  aesthetic  value  of  each 
highway,  and  the  aesthetic  criterion  value  for  each  highway.  The 
triangles  consist  of  heteroconcept  correlations  between  the  judgments 
made  in  each  concept-method  condition  and  the  criterion  for  a  different 
concept.  Thus,  the  number  .016  (just  below  .855)  represents  the 
correlation  between  the  judgments  of  aesthetics  for  each  highway  and  the 
safety  criterion  for  each  highway.  Because  this 
multiconcept-multimethod  matrix  can  be  used  to  evaluate  an  individual's 
performance  when  his/her  judgments  are  compared  with  a  criterion 
measure,  we  call  it  a  "performance  validity  matrix." 


Insert  Table  3  about  here 


The  coefficients  in  the  performance  validity  matrix  in  Table  3  are 
different  from  those  in  the  coherence  validity  matrix  in  that  each 
correlation  in  the  performance  validity  matrix  is  between  judgments  and 
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measures  of  a  criterion  rather  than  between  two  sets  of  responses. 
Aside  from  this  very  important  difference,  the  interpretation  of  the 
coefficients  with  respect  to  the  questions  of  convergent  and 
discriminant  validity  is  quite  similar.  As  in  the  coherence  validity 
matrix,  correlations  in  the  validity  diagonal  that  are  sufficiently 
large  are  evidence  of  convergent  validity.  In  Table  3  the  coefficients 
in  the  diagonals  within  each  method  block  show  the  convergent  validity 

of  the  judgment  of  each  concept  by  that  method.  Comparison  of  the 

average  of  these  diagonal  values  across  the  three  concepts  indicate  the 
relative  external  convergent  validity  of  each  method.  The  heteroconcept 
triangles  consist  of  the  correlations  of  an  expert's  judgments  of  one 
concept  (by  a  particular  method)  with  the  criterion  measure  of  a 
different  concept.  Evidence  of  discriminant  validity  exists  when  a 
value  in  a  validity  diagonal  is  higher  than  the  values  lying  in  its 

column  and  row  in  the  heteroconcept  triangles.  (Precise  tests  of 
discriminant  validity  for  group  data  are  described  in  Widaman,  1985;  see 
also  Hammond,  Hamm,  i  Grassia,  1984.) 

The  Use  of  the  Performance  Validity  Matri x 

Convergent  validity  of  concepts  and  methods.  In  Table  3,  the 

validity  correlation  coefficient  for  the  artificial  engineer's 
aesthetics  judgments  made  by  the  film  strip  method  is  .855,  by  the  bar 
graph  method  is  .945,  and  by  the  formula  method  is  .951,  thus  producing 
a  mean  external  convergent  validity  value  across  all  three  methods  of 
.926  for  aesthetic  judgments.  Averaging  validity  correlations 
pertaining  to  safety  from  the  three  method  boxes,  the  mean  convergent 
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validity  (Pearson  r)  is  .568;  similarly,  averaging  the 
judgment-criterion  correlations  for  capacity  produces  a  mean  convergent 
validity  value  of  .530.  In  short,  the  data  suggest  that,  irrespective 
of  the  method  used,  the  artificial  engineer  judged  highway  aesthetics 
more  accurately  than  highway  safety  or  capacity,  and  judged  safety  and 
capacity  with  equal  accuracy. 

Convergent  validity  of  methods.  A  measure  of  the  performance 
convergent  validity  for  each  method  may  be  calculated  by  averaging  the 
judgment-criterion  correlations  within  each  of  the  diagonals  (.855, 
.702,  .291;  .945,  .683,  .833;  and  .951,  .226,  .266),  thus  obtaining 

performance  validities  for  each  method  (.672,  .854,  and  .654).  These 
results  suggest  that  the  artificial  engineer  judged  these  three  concepts 
most  accurately  using  the  bar  graph  method.  Finally,  the  mean  of  the 
latter  three  coefficients  is  .742.  This  measure  is  informative  because 
it  may  be  used  to  compare  one  group  of  experts  with  another,  to  compare 
one  individual  with  another  (in  the  case  when  a  matrix  is  constructed 
for  each  individual),  or  to  evaluate  the  effect  of  a  change  in  condition 
in  either  case.  Moreover,  the  referential  domain  of  this  measure  is 
clear;  it  is  general  over  the  three  methods  and  three  concepts  employed 
in  the  study,  as  well  as  the  group  of  engineers  selected. 

Discriminant  validity.  The  performance  validity  matrix  provides 
several  measures  of  discriminant  validity.  One  is  analogous  to  Campbell 
and  Fiske's  (1959)  first  method  for  calculating  discriminant  validity 
from  the  multitrait-multimethod  validity  matrix  (see  above).  Other 
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measures,  however,  have  the  advantage  of  taking  into  account  the 
intra-ecological  correlations. 

Measuring  performance  discriminant  validity  using  a  procedure 
analogous  ^  Campbel 1  and  Fiske's.  The  discriminant  validity  of  a 
specific  concept-method  unit  can  be  determined  using  the  first  measure, 
by  comparing  its  validity  coefficient  with  the  heteroconcept 
coefficients  that  include  the  concept  of  interest.  Thus,  for  example, 
in  Table  3  the  magnitude  of  the  validity  correlation  for  the  aesthetics 
judgment  made  by  the  film-strip  method  (.855)  can  be  compared  with  the 
magnitude  of  the  correlations  between  the  aesthetics  criterion  and  the 
safety  and  capacity  judgments  (-.362  and  -.473,  respectively),  as  well 
as  with  the  size  of  the  correlations  between  aesthetics  judgments  and 
the  safety  and  capacity  criteria  (.016  and  -.172,  respectively). 
Although  subjective  appraisals  of  such  comparisons  may  suffice, 
objective  comparisons  may  be  provided  by  subtracting  the  mean  of  the 
four  heteroconcept  correlations  from  the  monoconcept  correlation  of 
interest  (after  appropriate  z-transformations  and  sign  changes;  see 
discussion  of  Hypothesis  2,  below). 

Averaging  the  performance  discriminant  validity  measures  for  the 
artificial  engineer's  aesthetics  judgments  across  film  strip  (.598),  bar 
graph  (.611),  and  formula  (.736)  methods  produces  a  measure  of  .653  for 
the  artificial  engineer's  performance  discriminant  validity  with  regard 
to  aesthetics;  similarly,  for  safety  the  performance  discriminant 
validity  is  .178  and  for  capacity,  .115.  Analogous  procedures  produce 
measures  of  the  performance  discriminant  validity  of  each  method. 
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Measuri ng  discriminant  validity  with  reference  to  intra-ecol ogical 
correlations.  The  correlation  between  the  criterion  measures  of  the 
concepts  provides  a  standard  against  which  to  compare  the  heteroconcept 
correlations  between  the  expert's  judgment  of  one  concept  and  the 
criterion  measure  of  a  different  concept.  For  example,  since  the 
correlation  between  aesthetics  criteria  and  safety  criteria  is  -.275, 
then  it  is  appropriate  for  an  engineer's  judgments  of  aesthetics  to  be 
correlated  -.275  with  safety.  Similarly,  if  since  the  correlation 
between  two  criterion  measures  is  low  (as  for  safety  and  capacity, 
.180),  then  the  heteroconcept  correlations  should  also  be  low.  (See 
left-hand  side  of  Table  3  for  intra-ecological  correlations  of 
criteria.)  In  short,  the  observed  correlations  between  judgments  of 
aesthetics,  safety  and  capacity  for  an  engineer  are  not  to  be  compared 
to  a  standard  of  zero  (an  arbitrary  demand  for  complete  independence 
regardless  of  task  conditions)  but  to  a  standard  that  is  representative 
of  task  conditions,  if  we  are  properly  to  evaluate  the  discriminant 
validity  of  the  judgments  of  these  concepts  with  these  methods. 

To  "untie"  these  variables,  in  other  words  to  force  them  to  be 
orthogonal  to  one  another,  is  (a)  to  invite  the  engineer  to  judge  an 
unrepresentative  set  of  conditions  and  thus  (b)  to  extrapolate  the 
results  obtained  illegitimately  from  irrelevant  conditions  to  the 
relevant  ones.  These  two  tactics  have  an  embarrassingly  long  history  in 
psychology;  they  are  customarily  explained  away  by  arguments  that  "this 
is  the  best  we  can  do"  and/or  "it  doesn't  matter,  anyway."  Neither 


argument  is  correct,  but  neither  is  necessary;  the  performance  validity 
form  of  the  multiconcept-multimethod  matrix  makes  it  possible  to 
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evaluate  the  competence  of  experts  (or  other  subjects)  In  relation  to 
the  task  conditions  to  which  their  judgments  are  to  be  applied. 

Discriminant  validity  of  concept  pairs.  A  performance  validity 
measure  which  relates  judgments  to  intra-ecological  correlations  1s 
therefore  preferable  to  the  first  performance  discriminant  validity 
measure,  described  above.  The  performance  discriminant  validity  of  a 
pair  of  concepts  (e.g.,  aesthetics  and  safety)  can  be  determined  with 
the  second  performance  validity  measure  as  follows.  Each  of  the 
correlations  In  Table  3  Involving  aesthetics  and  safety  (-.362,  .016, 
-.479,  -.233,  -.226  and  -.313)  and  the  Intra-ecological  correlation 

between  the  criterion  measures  of  aesthetics  and  safety  (-.275)  are 
z- transformed.  The  difference  between  each  aesthetics-safety 

correlation  and  the  intra-ecological  correlation  Is  computed.  The 
absolute  values  of  these  differences  are  averaged;  and  the  mean,  .129 
for  the  aesthetics-safety  example,  is  an  Index  of  performance 
discriminant  validity.  The  corresponding  index  for  aesthetics  and 
capacity  is  .121,  and  for  safety  and  capacity,  .302.  This  Indicates 
that  the  safety  and  capacity  concepts  are  discriminated  least 
accurately. 

Compl ementarity  of  Coherence  and  Performance  Val Idity  Matrices 
Leads  a  Measure  of  Competence 

The  distinction  between  coherence  and  performance  Is  Intended  to 
parallel  the  traditional  distinction  between  the  coherence  and 
correspondence  theories  of  truth  (see,  e.g..  White,  1967,  and  Prior, 
1967).  The  coherence  theory  focuses  on  the  extent  to  which  statements 


Generalizing  over  Conditions 


Page  18 
2  January  1986 


of  facts  or  judgments  put  forward  cohere,  or  “hang  together"  with  one 
another;  that  Is,  are  related  by  logical  implication.  Like  the 
coherence  theory  of  truth,  the  coherence  validity  matrix  demands  logical 
rather  than  external  justification.  Although  the  coherence  matrix  does 
Include  empirical,  factual  material,  no  reference  to  empirical  criteria 
outside  the  matrix  Itsel f  Is  required  to  establish  the  construct 
validity  of  a  set  of  psychological  concepts.  All  that  Is  required  Is 
that  a  logical  criterion  be  met,  namely,  that  convergent  validities 
should  be  high  and  discriminant  validities  should  be  low. 

The  correspondence  theory  of  truth,  on  the  other  hand.  Is  concerned 
with  the  extent  to  which  our  beliefs  about  the  world  correspond  to 
independently  determined  facts.  Therefore  an  Independent  measure  of  the 
concepts  In  question  Is  required  In  order  to  test  the  correspondence 
between  what  a  theory  predicts  and  what  exists.  The  performance 
validity  matrix  thus  parallels  the  correspondence  theory  of  truth  in 
that  It  demands  the  evaluation  of  the  performance  of  a  theory;  It 
demands  the  evaluation  of  the  empirical  correspondence  between 
psychological  concepts  and  some  independent  measure  of  the  concepts. 
(See  Einhorn  4  Hogarth,  1982,  for  a  similar  treatment  of  the  components 
of  expert  judgment  In  which  "truth,"  our  "coherence,"  combined  with 
"accuracy,"  our  "performance,"  constitute  "justifiability".) 

Because  both  matrices  can  be  developed  for  a  single  subject  (and 
aggregated  for  group  analyses).  It  Is  possible  to  combi ne  the  results 
from  each  matrix  Into  a  single  measure  to  provide  a  higher  order 
Indicator  of  each  individual's  judgment  that  we  shall  call  "competence" 
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(see  also  McClelland,  1973).  Since  we  derive  the  measure  of  competence 
from  measures  of  coherence  and  performance  that  are  based  on  variations 
in  both  method  and  concept,  our  derivation  copes  directly  with  the 
problem  of  generalization.  In  the  present  case,  for  example,  the 
conclusions  about  an  expert's  coherence  and  performance,  and  thus 
competence,  are  clearly  based  on,  and  thus  limited  to,  his  behavior  over 
the  three  methods  and  three  concepts  employed  in  the  study.  (J.  R. 
Kirwan,  personal  communication,  December,  1985,  has  innovatively  applied 
this  method  to  physicians'  judgments.) 

Indices  ^  Coherence,  Performance  and  Competence 

Individual  differences  among  engineers  can  be  studied  using 
numerical  measures  of  convergent  and  discriminant  validity  derived  from 
both  the  coherence  and  performance  validity  matrices.  Procedures  for 
evaluating  validity  can  be  converted  into  numerical  measures  by  adding 
(or  subtracting,  as  appropriate)  the  correlations  in  all  the  relevant 
cells  in  the  matrix.  These  measures  can  be  combined  into  indices  of 
overal 1  coherence  validity,  which  indicates  the  coherence  of  the 
engineer's  judgments;  the  corresponding  index  of  performance  validity 
indicates  the  correspondence  between  the  engineer's  judgments  and 
reality.  And  the  mean  of  these  two  indices  provides  a  measure  of  the 
engineer's  overall  competence.  Each  index  can  be  produced  at  different 
levels  of  aggregation  (e.g.,  for  each  concept  or  for  each  pair  of 
methods),  thus  allowing  numerical  comparisons  among  these  indices  at 
each  level.  (The  formulas  for  producing  these  indices  can  be  found  in 
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The  treatment  of  coherence  and  performance  as  separate  cognitive 
functions  thus  will  allow  us  to  examine  empirically  questions  of 
fundamental  importance  to  both  philosophers  and  psychologists.  For 
example. 


1.  Does  high  competence  always  imply  both  high  coherence  and  high 
performance?  Are  coherence  and  performance  always  functionally 
interdependent?  Can  one  exist  in  the  absence  of  the  other? 

2.  Should  the  measures  of  coherence  and  performance  be  equally 
weighted  components  of  competence?  Currently,  different 
approaches  to  the  study  of  cognition  weight  these  components  of 
competence  differently.  For  example,  students  of  artificial 
intelligence  and  problem  solving  weight  the  experts'  coherence 
(and  the  coherence  of  the  computer  program  that  simulates  the 
expert)  very  highly  while  virtually  ignoring  performance. 
Students  of  judgment  and  decision  making  do  the  opposite 
(Hammond  et  al.,  1980).  These  two  fields  of  research  are 
apparently  investigating  complementary  aspects  of  competence 
among  experts. 

A  fairly  high  correlation  (.60)  was  found  between  coherence  and 


performance  among  the  experts  in  the  example  used  here. 
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Application  ^  Experimental  Psychology 

In  basic  research,  the  results  of  experiments  are  expected  to  be 
general,  that  is,  not  contingent  upon  the  use  of  a  single  method.  In 
addition,  it  is  expected  that  the  experiments  will  show  that  the 
concepts  employed  to  describe  the  results,  and  not  others,  are  indeed 
responsible  for  the  results.  We  indicate  how  the  construction  of  the 
coherence  and  performance  validity  matrices  provides  a  systematic 
approach  to  the  testing  of  hypotheses  regarding  these  desiderata. 

Coherence  Validity  Matri x 

The  first  requirement  is  that  a  coherence  validity  matrix,  similar 
to  those  in  Tables  2  and  3  above,  be  produced  for  each  engineer,  and 
that  convergent  and  discriminant  validities  be  determined  for  each. 

Convergent  validity.  The  coherence  convergent  validity  measure 
(monoconcept  heteromethod  correlations  between  judgments  of  the  same 
concept  using  different  methods)  can  be  used  to  test  the  primary 
hypothesis  of  interest,  thus: 

HI:  Each  theoretical  concept  has  empirical  meaning 
independent  of  a  specific  method,  i.e.,  there  is 

convergent  validity  for  each  concept  across  methods  and 
within  an  appropriate  sample  of  subjects. 


Hypothesis  1  was  tested  by  asking  whether,  for  each  subject, 
judgments  of  a  concept  covary,  independently  of  the  methods  used  to  make 
the  judgments.  For  example,  for  the  artificial  engineer  (Table  2)  the 
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correlation  between  the  film  strip  and  bar  graph  methods  for  the 
aesthetics  concept  is  .890;  for  the  film  strip  and  formula  methods, 
.864;  and  for  the  bar  graph  and  formula  methods,  .985.  The  overall 
convergent  validity  for  aesthetics  is  the  mean  of  these  correlations 
(z-tra ns formed),  .938,  which  is  significant  at  p  <  .001.  In  addition  to 
the  matrix  for  the  artificial  engineer,  a  matrix  was  developed  for  each 
of  the  20  engineers  individually,  and  this  procedure  was  carried  out  for 
each  of  the  three  concepts.  All  20  engineers  had  significant  positive 
convergent  validities  for  aesthetics,  16  for  safety,  and  17  for 
capacity.  Hence  we  conclude  that  each  of  the  three  concepts  is  capable 
of  being  measured  by  appropriate  subjects  independently  of  the  method 
used;  generality  has  been  achieved  over  three  methods.  The  generality 
of  results  regarding  more  specific  hypotheses  may  also  be  addressed. 
For  examples  see  Hammond,  Hamm,  and  Grassia  (1984). 

Discriminant  validity.  Convergent  validity  informs  us  about  the 
covariance  of  judgments  across  methods,  and  thus  about  the  status  of  a 
concept  independent  of  the  method  used  to  measure  it.  In  addition, 
however,  we  need  to  know  whether  the  concept  is  discriminable  from  other 
proposed  theoretical  entities.  Campbell  and  Fiske  (1959)  gave  first 
priority  to  this  test;  for  although  many  people  would  think  it  "so 
minimal  and  obvious  as  not  to  need  stating,"  (p.  82)  they  observed  that 
it  often  fails  to  be  true.  The  coherence  discriminant  validity  analysis 
employed  in  the  examples  below  compares  monoconcept  heteromethod 
correlations  to  heteroconcept  heteromethod  correlations  and  thus  allows 
an  evaluation  of  the  discriminability  of  each  concept.  It  also  permits 
a  more  detailed  investigation,  thus: 


Generalizing  over  Conditions 


Page  23 
2  January  1986 


H2:  All  pairs  of  concepts  are  equally  discriminable. 

This  hypothesis  was  tested  by  calculating  an  index  for  each  concept 
pair  for  each  engineer,  and  looking  for  evidence  of  any  concept  being 
more,  or  less,  discriminable  than  the  others,  for  a  statistically 
significant  nuiirt)er  of  engineers.  To  illustrate  the  calculation  of  the 
index  for  the  aesthetic  and  safety  concepts,  for  the  artificial  engineer 
of  Table  2,  we  compare  the  correlations  from  the  validity  (monoconcept 
heteromethod)  diagonals  that  involve  either  aesthetics  (.890,  .864, 
.985)  or  safety  (.713,  .393,  .422)  with  the  correlations  from  the 

heteroconcept  heteromethod  triangles  that  involve  both  concepts  (.283, 
.244,  .360,  .093,  .548,  and  .209).  (The  sign  on  all  heteroconcept 

correlations  involving  aesthetics  was  reversed  because  the 

intra-ecological  correlations  between  the  criterion  measures  of 
aesthetics  and  safety,  and  of  aesthetics  and  capacity,  were  negative.) 
In  order  to  aggregate  these  comparisons  into  an  index,  we  subtract  the 
mean  of  the  z-transformations  of  the  second  set  of  correlations  (.306) 
from  the  mean  of  the  z-transformations  of  the  first  set  (1.155),  which 
produces  an  index  (.849)  of  the  discriminabil ity  of  the  aesthetics  and 
safety  concepts.  The  corresponding  index  for  aesthetics  and  capacity  is 
.913;  for  safety  and  capacity,  -.047.  Thus,  for  the  artificial  engineer 
aesthetics  and  capacity  are  the  easiest  concepts  to  discriminate,  and 
safety  and  capacity  are  most  difficult  to  discriminate,  a  result  which 
carries  practical  implications. 
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This  index  of  discriminant  validity  is  calculated  for  each  concept 
pair  from  each  subject's  matrix,  and  the  order  among  concept  pairs  is 
determined.  For  all  20  engineers,  the  safety  and  capacity  concepts  were 
least  discriminable  (Chi-squared  =  37.053,  p  <  .001).  Therefore  null 
Hypothesis  2  is  rejected,  for  the  engineers'  judgments  of  safety  and 
capacity  are  more  similar  to  each  other  than  either  is  to  their  judgment 
of  aesthetics. 

Performance  Validity  Analysis 

Information  beyond  coherence  validity  is  necessary  for  experimental 
confirmation  of  hypotheses  regarding  the  phenomena  of  interest,  in  this 
case,  judgments  of  aesthetics,  safety  and  capacity. 

Convergent  val idity.  In  the  present  case,  measures  of  performance 
convergent  validity  can  be  based  on  the  correlation  between  an 
engineer's  judgments  of  a  concept  and  the  criterion  measure  of  that 
concept.  Rather  than  simply  testing  hypotheses  regarding  the 
performance  convergent  validity  of  each  concept  across  methods  we  test  a 
more  informative  hypothesis,  the  relative  convergent  validity  of  each 
concept,  thus: 

H3:  No  concept  has  higher  or  lower  performance  convergent 
validity  than  any  other. 

Hypothesis  3  is  tested  by  averaging  the  z-transforms  of  the  correlations 
for  each  concept  across  methods,  and  then  comparing  the  averages  for 
each  concept.  The  aesthetics  concept  had  higher  convergent  validity 
than  safety  or  capacity  for  all  20  engineers  (Chi -squared  =  37.053, 
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p  <  .001).  Despite,  or  because  of,  the  counterintuitive  nature  of  this 
result,  it  has  a  claim  to  our  attention  for  it  is  general  across  three 
methods  and  stands  against  two  other  concepts.  Similar  questions  of 
performance  convergent  validity  can  be  addressed  to  methods.  For 
examples,  see  Hammond,  Hamm,  and  Grassia  (1984). 

Discriminant  validity.  The  critical  question  is  whether  the 
concepts  in  question  are  discriminable  by  the  subjects.  Again  we 
inquire  into  the  relative  discrimi nabil ity  of  the  concepts  of  interest 
with  regard  to  performance,  thus, 

H4:  All  pairs  of  concepts  are  equally  discriminable. 

Hypothesis  4  can  be  tested  by  calculating  an  index  of 
discrimi nabil ity  for  each  concept  pair  for  each  engineer,  determining 
the  order  of  these  indices  for  each  engineer,  and  seeing  whether  any 
particular  order  occurred  in  a  significant  number  of  engineers.  The 
calculations  in  the  first  step  of  this  procedure  were  illustrated  above 
in  the  discussion  of  the  artificial  engineer’s  performance  in 
discriminating  between  pairs  of  concepts  (see  Table  3).  The  procedure 
is  carried  out  for  each  engineer,  producing  a  measure  of  how  well  he 
discriminates  each  pair  of  concepts,  over  all  three  methods.  The  safety 
and  capacity  concepts  were  least  accurately  discriminated  for  18  of  the 
20  engineers  (Chi -squared  =  26.4,  p  <  .001).  Tha  is,  18  of  the  20 
engineers  underdiscriminate  safety  and  capacity,  a  result  that  is 
consistent  with  the  results  from  the  coherence  validity  analysis. 
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Summary 

In  this  section  we  have  illustrated  the  application  of  the 
multiconcept-multimethod  analysis  to  topics  typically  of  concern  to 
experimental  psychologists,  namely,  testing  general  propositions 
regarding  behavior.  In  tne  examples  given  above  the  convergent  and 
discriminant  validity  of  three  concepts  was  tested  across  three  methods 
within  the  same  study  using  the  same  subjects.  This  was  done  by  using 
the  coherence  validity  matrix,  which  is  concerned  solely  with  the 
relations  among  different  judgnents  of  the  concepts  obtained  under 
different  methods,  and  with  the  performance  validity  matrix,  which  is 
concerned  with  the  relation  between  the  experts'  judgments  and  the 
criterion  measures  of  the  concepts. 

Our  illustration  highlights  the  complementarity  of  these  two 
analyses.  We  found  in  both  the  coherence  and  performance  validity 
analyses  that  the  aesthetics  concept  has  the  highest  convergent 
validity,  and  that  safety  and  capacity  are  least  discriminable  from  each 
other.  The  performance  validity  analysis  was  able  to  put  this  last 
finding  in  sharper  perspective  than  could  the  coherence  validity 
analysis  alone.  It  showed  that  the  experts  underdiscriminate  safety  and 
capacity  in  comparison  with  the  intercorrelation  between  the  criterion 
measures  of  these  concepts.  The  engineers'  judgments  of  these  two 
concepts  are  most  highly  correlated,  while  in  fact  the  criterion 
measures  of  these  concepts  have  the  lowest  intercorrelation.  This  could 
have  been  otherwise;  that  is,  if  the  criterion  measures  of  safety  and 
capacity  had  actually  been  very  highly  correlated,  the  engineers  might 


Generalizing  over  Conditions 


Page  27 
2  January  1986 


have  overdiscriminated  them.  The  performance  validity  analysis  provides 
the  only  way  to  determine  which  of  these  possibilities  is  true. 

The  difference  between  this  approach  to  establishing  the  generality 
of  results  from  experimental  research  can  best  be  understood  by 
contrasting  it  with  conventional  methods. 

Conventional  Methods  of  Generalization 

In  spite  of  their  many  elaborations  conventional  methods  omit  the 
crucial  comparisons  of  convergent  and  discriminant  validity  and  thus 
claim  cumulative  results  without  sufficient  analytical  justification. 
For  an  example  that  typifies  the  largely  intuitive  method  of 

accumulating  results  in  conventional  practice,  we  consider  Anderson's 
discussion  (1985,  pp.  110-112)  of  experimental  evidence  for  differential 
memory  of  abstract  and  concrete  knowledge.  Anderson  cites  two 
experiments,  one  regarding  the  retention  of  perception-based  knowledge 
and  one  regarding  the  retention  of  meaning-based  knowledge.  He 

concludes  that  the  results  confirm  each  other  and  thus  justify  the 
generalization  that  "we  remember  abstract  information,  not  details"  (p. 
112).  Anderson's  method  of  accumulating  results  is  typical  in  that  it 
rests  on  the  observation  that  one  experimenter  has  used  one  method  to 
test  the  validity  of  an  hypothesis,  a  second  experimenter  has  used  a 
different  method  to  test  the  same  hypothesis,  and  similar  results  have 
been  obtained,  thus  general  knowledge  is  claimed.  But  examination  of 
the  experimental  results  from  the  standpoint  of  the  MTMM  methodology 
shows  why  this  claim  is  not  justified. 
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The  first  experiment  cited  by  Anderson  (1985)  was  reported  in  an 
article  by  Posner  (1969)  in  which  the  difference  in  reaction  time  to  an 
“identity  match"  and  a  "name  match"  of  letters  was  taken  as  a  measure  of 
differential  retention  of  information.  Anderson  points  out  that 
"initially  there  is  a  large  advantage  [i.e.,  shorter  reaction  time]  for 
the  identity  match  but  after  a  two-second  [inter-stimulus]  interval  this 
advantage  has  almost  completely  disappeared.  This  alteration  indicates 
that  memory  for  the  initial  stimulus  is  rapidly  transformed  into  an 
abstract  code  that  does  not  retain  specific  visual  information"  (pp. 
110-111). 

The  second  experiment  cited  was  carried  out  by  Anderson  (1974).  He 
remarks  that  his  experiment  "makes  the  same  point  in  the  verbal  domain 
as  Posner's  did  in  the  perceptual  domain"  (p.  111).  In  Anderson’s 

(1974)  study,  as  in  Posner's,  there  is  the  possibility  of  a  specific 
information  match  and  an  abstract  code  match.  Anderson's  study  differs 
from  Posner's  in  that  the  match  occurs  in  the  context  of  choosing  among 
the  logical  implications  of  critical  sentences  in  a  story.  It  is 
similar,  however,  in  that  reaction  time  was  also  used  to  evaluate 
differences  between  response  categories  that  are  analogous  to  Posner's 
identity  match  and  name  match.  Differences  in  reaction  time  are  again 
observed  to  be  related  to  a  length  of  time  between  presentations  of  the 
stimuli  (inter-stimulus  intervals).  Anderson  (1985)  concludes  that  his 
results  confirm  Posner's,  and  summarizes  with  the  following 
generalization:  "So,  it  seems  that  verbal  information,  like  visual 

information,  tends  to  be  short-lived  and  that  after  delays  we  mainly 
remember  abstract  information"  (p.  112). 
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But  the  conclusion  Is  unjustified;  the  aggregation  process  Is 
analytically  Incomplete  because  neither  convergent  validity  nor 
discriminant  validity  was  established  In  either  experiment.  Moreover, 
the  meaning  of  the  principal  concept  (memory  for  abstract  versus 
concrete  materials)  Is  exhausted  In  both  cases  by  the  same  single 
operation  (reaction  time  In  relation  to  Inter-stimulus  Interval),  thus 
offering  us  no  evidence  that  the  result  is  not  confined  to  the  reaction 
time  measure. 

The  Incompleteness  of  conventional  practices  for  aggregating 
results  can  be  seen  when  the  Posner  and  Anderson  studies  are  represented 
In  the  multiconcept  multimethod  matrices  of  coherence  and  performance 
(see  Table  4).  Even  If  we  assume  that  (a)  the  abstract-concrete 
dimension  can  be  separated  Into  two  concepts  and  (b)  Posner  and 
Anderson's  studies  represent  two  different  methods,  we  find  that,  of  the 
four  required,  only  one  cell  related  to  discriminant  validity  can  be 
filled  In  In  each  study;  no  evidence  of  convergent  validity  (I.e., 
validity  Independent  of  method)  can  be  provided.  Separately,  then,  each 
study  is  Incomplete. 


Insert  Table  4  about  here 


A  performance  validity  matrix  cannot  be  constructed  for  either 
study  because  of  the  restriction  of  the  measurement  of  the 
abstract-concrete  dimension  In  both  studies  to  reaction  time.  That  Is, 
although  monoconcept  heteromethod  correlations  can  be  calculated  (see 
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the  top  row  of  the  performance  validity  matrix)  these  results  are 
limited  to  the  single  criterion  used;  no  data  are  available  for  other 
measures  of  performance  such  as,  for  example,  accuracy  of  recall,  which, 
after  all.  Is  what  the  content  of  tiie  generalization  claims. 

If,  however,  we  relax  conditions  further  and  pretend  that  each 
subject  participated  In  all  conditions,  then  we  find  that  a  hypothetical 
matrix  would  permit  the  examination  of  both  discriminant  and  convergent 
validity  In  the  coherence  validity  matrix.  By  this  procedure  each  cell 
In  the  matrix  Is  filled  {If  we  further  assume  that  reliabilities  were 
calculated).  That  Is,  all  the  question  marks  In  Table  4  would  be 
removed.  But  the  use  of  the  same  criterion  measure  {reaction  time)  In 
both  studies  makes  It  Impossible  to  use  the  hypothetical  aggregation  to 
contrast  performance  with  regard  to  memory  for  detail  or  for 
abstractions  even  If  the  same  subject  participated  In  all  conditions. 

Anderson  {1985)  cites  a  third  study,  however,  that  was  carried  out 
by  Kolers  {1979),  that  does  use  a  direct  criterion— accuracy  of 
recall--for  evaluating  the  retention  of  pictorial  material. 
Unfortunately  for  Anderson's  generalization,  Kolers  {1979)  found  results 
opposite  to  those  obtained  by  Posner  and  Anderson.  Anderson  states 
that:  "In  a  series  of  clever  experiments,  Kolers  {1979)  has  shown  that 

under  appropriate  conditions  we  can  retain  visual  details  about  the 
typography  of  a  page  of  print  for  months!"  {Anderson,  1985,  p.  112). 
Because  Kolers  used  a  different  criterion  of  performance  for  the 
generalization  from  that  used  by  Posner  and  Anderson,  we  are  thus  left 
with  two  contradictory  performance  validity  matrices.  But  since  each 
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incomplete,  we  cannot  be  certain  which  conclusion  should  be  denied  or 
accepted. 

What,  then,  are  we  to  conclude,  other  than  that  sometimes  one  gets 
one  result,  sometimes  another?  As  matters  stand,  we  cannot  reconcile 
the  contradictory  results.  Nor  can  we  decide  what  experiment  to  do  next 
without  an  analytical  framework  that  indicates  what  information  is 
needed  to  di scon firm  these  and  other  alternative  plausible  hypotheses. 
The  coherence  and  performance  matrices,  however,  make  the  requirements 
of  future  studies  obvious  because  they  specify  which  cells  must  be 
filled  in  order  to  defend  the  generalization.  (See  Farell,  1985,  for  a 
description  of  the  wide  variety  of  methods  and  concepts  that  must  be 
considered  in  relation  to  "same" -"different"  judgments.) 

Our  aim,  of  course,  is  not  to  single  out  for  criticism  the  above 
studies  or  Anderson's  way  of  reporting  them.  Rather,  it  is  our 
intention  to  illustrate  the  largely  intuitive,  analytically  incomplete, 
conventional  method  of  aggregation  and  to  urge  its  replacement  by  the 
analytical  method  described  here,  or  by  better  ones  (see,  Meehl ,  1978, 

for  severe  criticism  of  the  conventional  methods  of  cumulating  results 
and  a  recommendation  for  improvement).  But  since  psychologists'  (and 
other  scientists')  cognitive  strategies  for  asserting  confirmation 
and/or  generalization  of  results  are  largely  intuitive  they  can  be 
studied,  and  thus  described,  from  the  standpoint  of  judgment  and 
decision  theory  (Hammond  et  al.,  1980,  Elnhorn  S  Hogarth,  1981;  Pitz  X 
Sachs,  1984),  and  discrepancies  between  scientists'  cognitive  strategies 
and  normative  procedures  could  be  examined  with  considerable  profit,  no 
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matter  what  the  results  turned  out  to  be.  Tversky  (1977),  Tversky  and 
Gati  (1982)  and  Gati  and  Tversky  (1984)  have  shown  that  judgments  of 
similarity  and  dissimilarity,  which,  of  course,  are  at  the  root  of 
scientists'  judgments  of  confirmation  and/or  generalization,  can  be 
analyzed  and  understood  in  terms  of  the  relative  weights  attached  to 
common  and  distinctive  features  of  various  entities.  (Other  methods  are 
described  in  Hammond  et  al.,  1980.)  Such  studies  would  provide 
descriptions  of  scientists'  judgment  processes,  which  could  then  be 
compared  to  prescriptive,  (normative)  means  of  aggregating  results,  and 
thus  enable  us  to  discover  the  nature  and  extent  of  the  differences 
between  them.  But  in  order  to  accomplish  that  goal,  a  prescriptive, 
normative  methodology  such  as  we  have  described  here  must  be  provided; 
no  other  exists  at  present.^ 

Summary  and  Discussion 

As  several  psychologists  have  observed,  psychological  research 
lacks  the  cumulative  character  critical  to  the  development  of  a  science. 
In  any  such  circumstance  suspicion  would  arise  that  the  scientific 
discipline  in  question  is  the  captive  of  a  flawed  theoretical  or 
methodological  dogma.  Since  theories  are  numerous  in  psychology,  but 
methodology  is  uniform  throughout  graduate  schools  and  journal  reviews, 
dogmatic  methodology  must  be  the  prime  suspect. 

We  extended  and  integrated  the  pioneering  efforts  of  Campbell  and 
Fiske  (1959)  and  Brunswik  (1956)  in  order  to  replace  the  current 
judgment -based  method  with  an  analytically  based  methodology  for 
achieving  generalization  over  conditions  as  well  as  subjects.  We  then 
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presented  an  example  of  how  generalization  over  methods  and  concepts  can 
be  obtained  by  the  use  of  the  coherence  and  performance  validity 
matrices. 

The  logic  of  the  coherence  validity  matrix  Is  based  on  what  Felgl 
called  "tri angulation  In  logical  space"  (Felgl,  1958;  see  also  Campbell 
4  Fiske,  1959,  p.  84).  From  a  logical  point  of  view,  the  methods  and 
concepts  selected  for  study  should  be  completely  Independent;  the 
"triangulation"  should  approximate  a  right  triangle  as  nearly  as 
possible.  Thus,  Campbell  and  Fiske  (1959)  discuss  "convergence  of  the 
Independent  methods"  and  cite  Cronbach  and  Meehl's  argument  that  the  use 
of  "diverse  criteria  g1ve[s]  greater  weight  to  the  claim  of  construct 
validity  than  do  ...  predictions  of  very  similar  behavior"  (Cronbach  4 
Meehl,  1955,  p.  295). 

Brunswik,  however,  emphasized  the  fact  that  the  ecological 
variables  that  so  often  serve  as  criteria  for  psychologists'  concepts 
are  not  Independent,  I.e.,  orthogonal  to  one  another.  In  the  organism's 
natural  habitat.  At  the  very  least,  such  independence  should  not  be 
taken  for  granted  and  uncritically  made  the  essential  design  feature  of 
every  experiment.  Therefore,  from  the  researcher's  point  of  view, 
Flegl's  concept  of  "triangulation  In  logical  space"  Is  not  to  be  seen  as 
a  goal,  but  as  a  condition  that  serves  didactic  purposes,  without  regard 
to  the  demands  of  specific  problems.  The  proper  goal  for  the  researcher 
(in  contrast  to  the  logician)  Is  "tri angulation  In  empirical  space,"  In 
which  the  logician's  worship  of  orthogonality  Is  replaced  by  the 
researcher's  worship  of  empirical  generalization.  Informative  as  the 
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logician's  remarks  undoubtedly  are,  the  proper  goal  of  basic  research  is 
generalization  of  results.  That  goal  can  best  be  achieved  through  the 
use  of  "representative  triangulation,"  and  through  the  use  of  a 
performance  validity  matrix,  in  experiments  as  well  as  in  studies  of 
individual  differences.  We  demonstrated  how  the  conventional  method  of 
cumulating  results  is  analytically  incomplete,  and  thus  largely 
intuitive.  Therefore,  unjustified  claims  of  generality  are  to  be 
expected. 

But  representative  triangulation  will  require  a  substantial  shift 
in  methodology  that  takes  cognizance  of  (a)  the  congruence  between  the 
strategy  of  conventional  experimental  designs  and  the  aims  of  applied 
research,  and  (b)  the  incongruence  between  conventional  experimental 
designs  and  the  aims  of  basic  research.  Conventional  research  does  not 
demand  that  claims  of  generalization  over  conditions  or  "treatments"  be 
analytically  justified  (as  our  example  showed).  It  does  demand  that 
generalization  over  subjects  be  justified  (as  the  plethora  of 
statistical  tests  over  populations  demonstrates).  Justification  of 
generalization  over  subjects  is  to  be  expected  in  appl i ed  research, 
especially  applied  agricultural  research,  which  is  the  source  of 
conventional  designs  in  experimental  psychology  (cf.  Newell  A  Simon, 
1972,  who  make  a  similar  observation).  That  is  because  applied 
agricultural  research  is  disinterested  in  generalizing  over  conditions; 
once  the  treatment  effects  have  been  established,  that  is,  found  to  be 
general  over  the  relevant  subjects  (plants  or  animals),  the  farmer  can 
control  future  conditions,  i.e.,  apply  only  the  "treatment"  that  works. 
Thus,  the  research  user  gets  the  information  s/he  wants.  But 
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controlling  conditions  outside  the  experiment  Is  precisely  what  basic 
research  psychologists  cannot  do.  In  lieu  of  applying  the  successful 
treatments  as  the  farmer  does,  they  must  assert,  by  intuitive  judgments 
of  similarities  and  dissimilarities  between  different  laboratory 
experiments,  that  the  results  provide  evidence  of  confirmation  and/or 
generalization.  Thus,  basic  researchers  use  exactly  the  wrong  strategy, 
namely,  fixed  conditions  and  general  subjects,  a  strategy  requiring 
generalization  by  judgments  about  what  constitutes  confirmation  and/or 
generalization.  Although  considerable  progress  In  understanding  the 
nature  of  scientists'  "generalization  by  judgment"  might  well  be 
achieved  by  means  of  the  various  methods  of  judgment  analysis,  basic 
researchers  should  employ  a  strategy  appropriate  to  basic  research, 
together  with  an  analytical  method  for  justifying  claims  of  generality. 
One  such  analytical  method  Is  described  here. 
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APPENDIX  A 

Procedure  for  Combining  Indices  of  Validity 


The  various  indices  (e.g.,  of  internal  discriminant  validity, 
external  validity,  or  overall  validity)  are  produced  by  taking  the  mean 
of  the  appropriate  subindices  (e.g.,  the  first  measure  of  internal 
discriminant  validity,  or  external  convergent  validity)  according  to  the 
pattern  illustrated  in  Figure  A-1.  Each  subindex  is  produced  for  each 
engineer  by  taking  the  mean  of  z-transformed  correlations,  from  specific 
locations  in  the  internal  or  external  validity  matrices,  or  the  mean  of 
the  differences  between  such  z-transformed  correlations,  corresponding 
to  the  comparisons  that  were  illustrated  above  with  Hypotheses  1-4. 
Table  A-1  displays  the  formulas  for  each  of  the  9  subindices,  at  each  of 
6  possible  levels  of  aggregation.  For  example,  the  formula  for  the 
internal  convergent  validity  index,  at  the  concept  level  of  aggregation. 


M 

j7k  "’j'"k 


This  index  is  calculated  for  each  concept  m.  It  is  the  mean,  over  all 
pairs  of  methods  j  and  k  where  j  is  different  from  k,  of  the 
z-transformations  of  r„  _  ,  which  is  the  correlation  between  two 
judgments  of  concept  m,  using  method  j  and  method  k.  The  correlations 
for  the  external  validity  matrix  are  (with  one  exception)  of  form  r„„  ; 

-  mOj  ’ 

that  is,  the  correlation  between  the  criterion  measure  of  concept  m  and 
the  engineer's  judgment  of  concept  n  using  method  j.  M  is  used  as  a 
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"mean"  symbol,  representing  a  sum  of  correlations  divided  by  the  number 
of  correlations  summed  over.  The  correlations  involved  in  producing  all 
the  subindices  in  this  table  have  been  z-transformed. 


Insert  Figure  A-1  and  Table  A-1  about  here 


Once  the  subindices  are  calculated  as  in  Table  A-1,  they  combined 
as  indicated  in  Figure  A-1.  Thus,  the  mean  of  the  three  internal 
discriminant  validity  subindices  (IDVj,  IDV2,  and  IDV^)  is  the  index  for 
internal  discriminant  validity  (IDV);  the  mean  of  IDV  and  the  internal 
convergent  validity  index  (ICV)  is  the  index  for  coherence  or  internal 
validity  (IV);  and  the  mean  of  IV  and  the  index  for  performance  or 
external  validity  (EV)  is  the  index  for  overall  competence.  Further 
discussion  can  be  found  in  Hammond,  Hamm,  and  Grassia  (1984). 
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Footnote 

^eta-analysis  (see,  for  example.  Light  &  Plllemer,  1984)  does  not 
meet  our  criteria  for  aggregating  results  of  experiments  because  It  does 
not  require  a  distinction  between  studies  that  establish  discriminant 
and  convergent  validity  and  those  that  do  not. 
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heteroconcept  heterooiethod 


Performance  Validity  Multiconcept-Multimethod  Matrix  for  Artificial 
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Table  4 

Coherence  and  Performance  Validity  Matrices  for  Posner  and  Anderson  Studies 
Coherence  Validity  Matri x 

Posner  (Perception)  Anderson  (Verbal) 


DV^ 


Performance  Val idity  Matrix 

Criterion  1  ^  -  f  r 

(RT/ISI)  PV^ll  PV^12  PV^ll  PV’^IZ 

Criterion  ?  f  ^  f  f 

(Accuracy)  r 

Key: 

a  :  Reliabilities 

b  :  Discriminant  validity  provided  by  Posner  study 
c  :  Discriminant  validity  provided  by  Anderson  study 
d  :  Heteromethod-heteroconcept  discriminant  validities 
e  :  Convergent  validities 
f  :  Performance  validities 


E0V3 


(panui juoi)  x-V *;c»i 


£DirJ 
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Figure  Captions 

Figure  1.  Design  of  the  highway  engineers  study. 

Figure  A-1.  The  structure  of  indices  representing  coherence, 
performance  and  overall  competence. 
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